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Abstract — In this paper we investigate a technique to find out 
vocal source based features from the LP residual of speech signal 
for automatic speaker identification. Autocorrelation with some 
specific lag is computed for the residual signal to derive these 
features. Compared to traditional features like MFCC, PLPCC 
which represent vocal tract information, these features represent 
complementary vocal cord information. Our experiment in fusing 
these two sources of information in representing speaker char- 
acteristics yield better speaker identification accuracy. We have 
used Gaussian mixture model (GMM) based speaker modeling 
and results are shown on two public databases to validate our 
proposition. 

Index Terms — Speaker Identification, Feature Extraction, Au- 
tocorrelation, Residual Signal, Complementary Feature. 



I. Introduction 

Speaker Identification |IT]-|l3] is the task to determine the 
identity of the unknown subject by its voice. It requires 
a robust feature extraction technique followed by a modeling 
approach. Through feature extraction process the crude speech 
signals are undergone through several dimensionality reduc- 
tion operations where the consequent outputs are more com- 
pact and robust than the original speech. Although the speech 
signal is a non-stationary signal it shows short term stationary 
in the interval 20-40 ms [4|. The vocal tract characteristics are 
almost static during this period. Most of the feature extraction 
techniques are based on short term spectral analysis of the 
speech signal. The central idea is to capture the information 
related to the formant frequency and their characteristics 
like magnitude, bandwidth, and slope etc through various 
spectrum estimation techniques. The shortcoming of these 
techniques is that it neglects the vocal cord (or vocal fold) 
characteristics which also carries significant speaker specific 
information. The residual signal obtained through linear pre- 
diction (LP) analysis of speech contains information related 
to vocal cord. Some approaches are investigated to find and 
model the speaker specific characteristics from this residual 
signal. Auto associative neural network (AANN) |5 |, wavelet 
octave coefficients of residue (WOCOR) |6|, residual phase 
Q, higher order cumulant |8 1 are employed earlier to reduce 
the equal error rate (EER) for speaker recognition system. In 
this work we have considered the autocorrelation method in 
finding the speaker specific traits from it. The contribution 
of this residual feature is fused with the spectral feature 
based system. We have conducted experiments on two popular 



publicly available speaker identification database (POLYCOST 
and YOHO) using GMM based classifier It is observed that 
the performances of the speaker identification systems based 
on various spectral features are improved in combined mode 
for different model order of GMM. The rest of the paper 
is organized as follows. In section II we briefly review the 
basic LP analysis followed by the proposed feature extraction 
technique. The speaker identification experiment with results is 
shown in section III. Finally, the paper is concluded in section 
IV. 

II. Feature Extraction From Residual Signal 

A. Linear Prediction Analysis and Residual Signal 

In the LP model, (n — l)-th to (n — p)-th samples of the 
speech wave (n, p are integers) are used to predict the n-th 
sample. The predicted value of the n-th speech sample [9J is 
given by 



s(n) — a{k)s{n — k) 



(1) 



k=l 



where the predictor coefficients and s{n) is 

the n-th speech sample. The value of p is chosen such that it 
could effectively capture the real and complex poles of the 
vocal tract in a frequency range equal to half the sampling 
frequency. The prediction coefficients (PC) are determined by 
minimizing the mean square prediction error [JJ and the error 
is defined as 



N-1 



(2) 



where summation is taken over all samples i.e., N . The set 
of coefficients {a(fc)}fc^i which minimize the mean-squared 
prediction error are obtained as solutions of the set of linear 
equation 



<P{j, k)a{k) = 0) , J = 1, 2, 3, . . . , p 



fe=i 



where 



N-1 



^ ^ s{n - j)s{n - k) 



(3) 
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The PC, are derived by solving the recursive 

equation 

Using the {a(fc)}fe^i as model parameters, equation (|5) 
represents the fundamental basis of LP representation. It 
implies that any signal can be defined by a linear predictor 
and its prediction error. 

p 

s{n) ~ a{k)s{n — fc) + e(n) (5) 

k=l 

The LP transfer function can be defined as, 

where G is the gain scaling factor for the present input and 
A{z) is the p-th order inverse filter These LP coefficients 
itself can be used for speaker recognition as it contains 
some speaker specific information like vocal tract resonance 
frequencies, their bandwidths etc. However, some nonlinear 
transformations are applied to those PC to improve the ro- 
bustness. Linear prediction cepstral coefficient (LPCC), line 
spectral pairs frequency (LSF), log-area ratio (LAR) etc are 
such representations of LPC. 

The prediction error i.e., e{n) is called Residual Signal and 
it contains all the complementary information that are not con- 
tained in the PC. Its worth mentioning here that residual signal 
conveys vocal source cues containing fundamental frequency, 
pitch period etc. 

B. Autocorrelation of Residual Signal 

The autocorrelation finds out the similarity of a signal with 
itself. It has a beautiful relationship with the power spectral 
density (PSD). In speech processing tasks the autocorrelation 
function is mostly popular in estimating the pitch of the signal 
and in LP based speech analysis. In pitch detection algorithm 
we compute fundamental frequency as the difference between 
the two consecutive peaks of autocorrelation function. On the 
other hand in LP analysis we figure out the second order 
relationship among the speech samples through autocorrelation 
function which is previously described in Sec. III-AI 
The autocorrelation of a discrete signal x{ti) of length is 
given by, 

N 

r{n) = x{k)x(k — n) (7) 

k=-N 

In eqn. O, we consider the full correlation over the shift from 
—N to N . The correlation also can be computed with some lag 
in the original signal to check only the short time similarity. 
If the maximum and minimum lag is bounded by magnitude 
of L then the correlation can be calculated as in eqn. (|8), 

L 

r,(n)= ^ x{k)x{k-n) (8) 

fe=-L 

Higher lag autocorrelation of speech signal was employed 
earlier for robust speech recognition engine [TOl. In our 
proposed method of feature extraction we have computed the 
autocorrelation function for a specific lag of residual signal. 



Firstly, the residual signal is normalized between -1 and +1 
to make the first order central moment i.e. mean to zero. This 
process also helps to reduce the unwanted variation in the 
autocorrelation. Secondly, we compute the autocorrelation and 
take the upper half only (due to the symmetrical property of 
the correlation function). The r;(0) is also included because it 
is related to the stress (energy) of a particular speech frame's 
residual. If we consider the lag of [— L, L\ then total number 
of coefficients becomes L + 1. The residual feature extracted 
through this technique is referred to as ACRLAG throughout 
this paper 

C. Integration of Complementary Information 

In this section we propose to integrate vocal tract and vocal 
chord parameters identifying speakers. In spite of the two 
approaches have significant performance difference, the way 
they represent speech signal is complementary to one another 
Hence, it is expected that combining the advantages of both the 
feature will improve |Tl] the overall performance of speaker 
identification system. The block diagram of the combined 
system is shown in Fig. [T] Spectral features and Residual 
features are extracted from the training data in two separate 
streams. Consequently, speaker modeling is performed for the 
respective features independently and model parameters are 
stored in the model database. At the time of testing same 
process is adopted for feature extraction. Log-likelihood of 
two different features are computed w.rt. their corresponding 
models. Finally, the output score is weighted and combined. 

We have used score level linear fusion which can be 
formulated as in equation (|9|l. To get the advantages of both 
the system and their complementarity the score level linear 
fusion can be formulated as follows: 

+ (1 - ll)LLRresidual (9) 

where LLRspectrai and LLRresiduai are log-likelihood ratio 
calculated from the spectral and residual based systems, re- 
spectively. The fusion weight is decided by the parameter 77. 

III. Speaker Identification Experiment 

A. Experimental Setup 

1 ) Pre-processing stage: In this work, pre-processing stage 
is kept similar throughout different features extraction meth- 
ods. It is performed using the following steps: 

• Silence removal and end-point detection are done using 
energy threshold criterion. 

• The speech signal is then pre-emphasized with 0.97 pre- 
emphasis factor 

• The pre-emphasized speech signal is segmented into 
frames of each 20ms with 50% overlapping ,i.e. total 
number of samples in each frame is = 160, (sampling 
frequency Fg — 8kHz). 

m In the last step of pre-processing, each frame is windowed 
using hamming window given equation 

w{n) = 0.54 - 0.46 cos(^^ — -) (10) 

where N is the length of the window. 
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Fig. 1. Block diagram of Fusion Technique: Score level fusion of Vocal tract (short term spectral based feature) and Vocal cord information (Residual). 



2) Classification & Identification stage: GMM technique 
is used to get probabilistic model for the feature vectors of 
a speaker The idea of GMM is to use weighted summation 
of multivariate gaussian functions to represent the probability 
density of feature vectors and it is given by 

M 

p{x)^J2pM^) (11) 

i=l 

where x is a d-dimensional feature vector, bi{x), i ~ 1, M 
are the component densities and pi, i = 1, ...,M are the mix- 
ture weights or prior of individual gaussian. Each component 
density is given by 

= ,^ ,1 exp|-^(x-/Xi)*Si'^(x-/Xi)l (12) 

(27r)2|Si|2 I ^ J 

with mean vector fii and covariance matrix Si. The mixture 
weights must satisfy the constraint that J^fLiPi — 1 ^"'1 Pi 
> 0. The GMM is parameterized by the mean, covariance and 
mixture weights from all component densities and is denoted 
by 

A = fe,Ati,Si}fli (13) 

In SI, each speaker is represented by the a GMM and is re- 
ferred to by his/her model A. The parameter of A are optimized 
using expectation maximization(EM) algorithm llT2l . In these 



experiments, the GMMs are trained with 10 iterations where 
clusters are initialized by vector quantization fill algorithm. 

In identification stage, the log-likelihood scores of the 
feature vector of the utterance under test is calculated by 

T 

logp(X|A) =^p((rt|A) (14) 
t=i 

Where X = {xi,X2, ■■■,Xt} is the feature vector of the test 
utterance. 

In closed set SI task, an unknown utterance is identified 
as an utterance of a particular speaker whose model gives 
maximum log-UkeUhood. It can be written as 

T 

5 = arg^max^^p(a;f|Afc) (15) 

where S is the identified speaker from speaker's model set 
A = {Ai, A2, As} and S is the total number of speakers. 
3) Databases for experiments: 

YOHO Database: The YOHO voice verification corpus 
yj, Iil4i was collected while testing ITT's prototype speaker 
verification system in an office environment. Most subjects 
were from the New York City area, although there were many 
exceptions, including some non-native English speakers. A 
high-quality telephone handset (Shure XTH-383) was used to 
collect the speech; however, the speech was not passed through 
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a telephone channel. There are 138 speakers (106 males and 32 
females); for each speaker, there are 4 enrollment sessions of 
24 utterances each and 10 test sessions of 4 utterances each. In 
this work, a closed set text-independent speaker identification 
problem is attempted where we consider all 138 speakers 
as client speakers. For a speaker, all the 96 (4 sessions x 
24 utterances) utterances are used for developing the speaker 
model while for testing, 40 (10 sessions x 4 utterances) 
utterances are put under test. Therefore, for 138 speakers we 
put 138 X 40 = 5520 utterances under test and evaluated the 
identification accuracies. 

POLYCOST Database: The POLYCOST database flSl was 
recorded as a common initiative within the COST 250 action 
during January- March 1996. It contains around 10 sessions 
recorded by 134 subjects from 14 countries. Each session 
consists of 14 items, two of which (MOTOl & MOT02 files) 
contain speech in the subject's mother tongue. The database 
was collected through the European telephone network. The 
recording has been performed with ISDN cards on two XTL 
SUN platforms with an 8 kHz sampling rate. In this work, a 
closed set text independent speaker identification problem is 
addressed where only the mother tongue (MOT) files are used. 
Specified guideline flS] for conducting closed set speaker 
identification experiments is adhered to, i.e. 'MOT02' files 
from first four sessions are used to build a speaker model while 
'MOTOl ' files from session five onwards are taken for testing. 
As with YOHO database, all speakers (131 after deletion of 
three speakers) in the database were registered as clients. 

4) Score Calculation: In closed-set speaker identification 
problem, identification accuracy as defined in [|16J and given 
by the equation (fTSI l is followed. 

Percentage of identification accuracy (PIA) = 

No. of utterance correctly identified 

— — X 100 (16) 

Total no. or utterance under test 

B. Speaker Identification Experiments and Results 

Experiments were performed using GMM based classifier 
of different model orders which are power of two i.e. 2, 4, 
8, 16, etc. The number of gaussian is limited by the amount 
of available training data (Average training speech length per 
speaker after silence removal: (i) POLYCOST-40s and (ii) 
YOHO-150s). The number of mixtures are incremented to 16 
for POLYCOST and 64 for YOHO database. We have evalu- 
ated the performance of ACRLAG feature as a front end for 
speaker identification task for both the databases. Experiments 
were conducted for different order of linear prediction and lag 
for optimal performance. An LP order of 12-20 is sufficient in 
capturing speaker specific information from the residual signal. 
Exhaustive search was also carried out for finding the optimal 
value of lag. The lag was chosen so that it can effectively 
capture the second order properties (autocorrelation) of one 
pitch (or pitch like information) in the residual signal. Exper- 
imentally we have observed that autocorrelation computation 
with a lag of 10-16 is sufficient to represent the residual 
information. In Table U the SI performance using ACRLAG is 
shown for LP order of 13 and 12 lags. Hence the number of 
residual feature becomes 12h-1=13. The performance attained 



through only residual feature (vocal cord information) is not 
much considerable, but it contains some useful complementary 
information which cannot be explained by standard spectral 
features. Hence, the LLR of the two systems are combined as 
stated in III-CI In our experiments the dimension for various 
spectral features is set to 19 for fare comparison, and this order 
is frequently used to model spectral information. The LP based 
features (LPCC, LAR, LSF, PLPCC) are extracted using 19 
order all pole modeling; on the other hand the filterbank based 
features (LFCC and MFCC) are extracted using 20 triangular 
shaped bandpass filters which are equally spaced either in 
Hertz scale (for LFCC) or in Mel scale (for MFCC). In Table 
Hn and Hn] the PIA for baseline systems as well as the fused 
systems (with rj = 0.5) are shown. The performance of the 
fused system is better throughout different spectral features 
and different model orders of GMM for both the databases. 
The improvement in performance is higher in lower model 
order compared to higher model order This is due to base 
effect which is usually experienced in performance analysis of 
newly proposed features for a classifier system. For example, 
in case of PLPCC feature based SI system performance is 
improved by 12% for POLYCOST and 15.13% for YOHO 
database using model order 2. But, in case of higher model 
order the improvements are 4.74% and 1.47%. The improve- 
ment in POLYCOST database (telephonic) is also significantly 
better than that of YOHO (microphonic) for various features. 
In this work all voiced and unvoiced speech frames are 
utiUzed for extracting residual features. Though only voiced 
frames contain pitch and significant residual information, still 
unvoiced frames contain information which is not considered 
in auto regressive (AR) based approaches (LP). Filter bank 
based approaches also removes pitch or pitch like information 
when wrapping with triangular filters. We have also performed 
an experiment by picking only voiced frames and observed 
that taking only those frames the performance is not improved, 
rather degraded due to the less amount of training data and 
curse of dimensionality effect in GMM modeling. 

It is desirable to note that the performance of the combined 
system which essentially considers 19h- 13=32 dimensions is 
better than the system which is based on any 32 dimensional 
single spectral feature based system. 

TABLE I 

Speaker Identification Results on POLYCOST and YOHO 

DATABASE USING ACRLAG FEATURE FOR DIFFERENT MODEL ORDER OF 

GMM (ACRLAG CONFIGURATION: LP ORDER = 13, Number of 
Lag= 12). 



Database 


No. of Mixtures 


Identification Accuracy 




2 


45.7560 


POLYCOST 


4 


49.3369 


8 


55.7029 




16 


59.6817 




2 


41.2319 


YOHO 


4 


49.3116 


8 


56.9022 




16 


63.2246 




32 


68.9130 




64 


73.3514 
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TABLE II 

Speaker Identification Results on POLYCOST database 

SHOWING the performance OF BASELINE (SINGLE STREAM) SYSTEM 
AND FUSED SYSTEM.) 



Spcctftil 


No of 


Btiseline 




Feature 


Mixtures 


System 


System 




2 


OJ.527y 


71.22U2 


LPCC 


1 

4 


74.5350 


— 7A O/^AO — 

7y,o4U!> 


o 


0\J.J 1 i't 






16 


79.8408 


83.0239 




L 


02.3342 




LAR 


4 


72.5464 


75.8621 


o 








16 


78.6472 


80.2387 




2 


OU. /42 / 


O/. / /ly 


LSF 


4 


66.o435 


72.14o5 




75 7294 


78.9125 




16 


78.1167 


80.9019 




2 


62.9973 


70.557U 


PLPCC 


4 


72.2612 


77.0557 


8 


75.0663 


80.1061 




16 


78.3820 


82.0955 




2 


62.7321 


72.1485 


LFCC 


4 


74.9337 


77.7188 


8 


79.0451 


82.3607 




16 


80.7692 


84.2175 




2 


63.9257 


70.1592 


MFCC 


4 


72.9443 


75.7294 


8 


77.8515 


79.3103 




16 


77.8515 


79.8408 



IV. Conclusion 

In this paper we present a new scheme for representing 
vocal cord characteristics. The vocal source feature is extracted 
using autocorrelation of the residual signal with some lag. 
The complementarity of this proposed feature with short term 
spectral feature is exploited through fusion technique. The lin- 
ear combination of log likelihood ratio score is formulated to 
utilize the advantages for combined system. The performance 
can further be enhanced with the help of advanced fusion 
techniques and optimal selection of LP order and lag. 
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